Skip to content

test(connection): add regression tests for MULTI/EXEC tx-queue stall#7273

Merged
romange merged 1 commit into
mainfrom
test/multi-exec-stuck-send
May 15, 2026
Merged

test(connection): add regression tests for MULTI/EXEC tx-queue stall#7273
romange merged 1 commit into
mainfrom
test/multi-exec-stuck-send

Conversation

@romange
Copy link
Copy Markdown
Collaborator

@romange romange commented May 8, 2026

Summary

  • Refactor: replace `Connection::RegisterBreakHook`/`breaker_cb_` with a virtual
    `ConnectionContext::OnSocketError` — cleaner polymorphic dispatch, no ad-hoc
    callback registered in `Service::CreateContext()`
  • Introspection: add `phase=scheduled` to `CLIENT LIST` to surface connections
    whose coordinator fiber is stuck waiting for a shard callback inside
    `Transaction::Execute()`
  • Test: `test_multi_exec_phantom_connections` reproduces the scenario from issue
    MULTI/EXEC Transaction Queue Deadlock on Silent Client Disconnect #7272, where ghost connections appear with `addr=0.0.0.0` and `phase=scheduled`
    in `CLIENT LIST` while their transactions are queued behind a lock-holding connection

Changes

  • `src/facade/conn_context.h` — add `virtual OnSocketError(uint32_t)` to base `ConnectionContext`
  • `src/server/conn_context.h/.cc` — implement `OnSocketError` override: calls `transaction->CancelBlocking(nullptr)`
  • `src/facade/dragonfly_connection.h/.cc` — remove `BreakerCb`, `RegisterBreakHook()`,
    `breaker_cb_`; register error callback unconditionally; `BreakOnce()` delegates to `cc_->OnSocketError()`
  • `src/server/main_service.cc` — remove `RegisterBreakHook` lambda from `CreateContext()`;
    add `is_scheduled` computation in `GetContextInfo()`
  • `src/facade/service_interface.h/.cc` — rename `subscribers`→`has_subscribers`,
    `blocked`→`is_blocked`; add `is_scheduled` field to `ContextInfo`
  • `tests/dragonfly/connection_test.py` — add `test_multi_exec_phantom_connections`

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 8, 2026 12:18
@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented May 8, 2026

🤖 Augment PR Summary

Summary: Adds regression coverage for issue #7272, where a blocked MULTI/EXEC reply can stall other transactions by holding a shard lock.

Changes:

  • Added test_multi_exec_stuck_send_stalls_tx to simulate a client that floods MULTI/GET 1MB/EXEC without reading, filling the TCP send buffer and blocking mid-EXEC.
  • Configures send_timeout=3 and asserts a competing transactional MULTI/SET remains blocked until the timeout forces a disconnect, then completes.
  • Added test_multi_exec_phantom_connections to reproduce “phantom” CLIENT LIST entries (addr 0.0.0.0, phase process) seen during the incident.
  • Uses raw asyncio streams and raw sockets (with SO_LINGER RST-close) to control backpressure and connection teardown precisely.
  • Runs with proactor_threads=1 to make the stall/lock interaction deterministic and observable via CLIENT LIST.

Technical Notes: These tests intentionally induce TCP backpressure (client not reading) to force server-side Send()/Write() suspension while a transaction still holds shard state/locks, mirroring the production diagnostic signals.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread tests/dragonfly/connection_test.py Outdated
Comment thread tests/dragonfly/connection_test.py
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds Python integration tests to reproduce and detect the MULTI/EXEC tx-queue stall incident described in #7272, focusing on blocked socket sends during EXEC and the resulting “phantom” connections in CLIENT LIST.

Changes:

  • Add test_multi_exec_phantom_connections to reproduce addr=0.0.0.0 + phase=process phantom connections when a clogger holds the shard lock and ghosts RST-close.
  • Add test_multi_exec_stuck_send_stalls_tx to reproduce a shard-lock stall caused by a blocked send during EXEC, and verify recovery via send_timeout.

Comment thread tests/dragonfly/connection_test.py
Comment thread tests/dragonfly/connection_test.py Outdated
Comment thread tests/dragonfly/connection_test.py Outdated
@romange romange force-pushed the test/multi-exec-stuck-send branch 5 times, most recently from 58fddef to 6e79d69 Compare May 15, 2026 10:23
…t::OnSocketError

Remove the ad-hoc breaker callback from Connection in favor of polymorphic
dispatch through ConnectionContext:

- Remove Connection::RegisterBreakHook(), BreakerCb type, and breaker_cb_ member
- Add virtual OnSocketError(uint32_t) to facade::ConnectionContext
- Implement ConnectionContext::OnSocketError to cancel blocking transactions,
  replacing the RegisterBreakHook lambda registered in Service::CreateContext()
- Register the error callback unconditionally (was gated on breaker_cb_ being set)
- Expose phase=scheduled in CLIENT LIST: coordinator fiber is blocked in
  run_barrier_.Wait() inside Transaction::Execute() while waiting for shard callback
- Add test_multi_exec_phantom_connections to reproduce the scenario from issue #7272

Signed-off-by: Roman Gershman <roman@dragonflydb.io>
@romange romange force-pushed the test/multi-exec-stuck-send branch from 6e79d69 to 6fd81ec Compare May 15, 2026 16:01
@romange romange requested a review from vyavdoshenko May 15, 2026 18:42
Copy link
Copy Markdown
Contributor

@vyavdoshenko vyavdoshenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@romange romange merged commit 944b416 into main May 15, 2026
13 checks passed
@romange romange deleted the test/multi-exec-stuck-send branch May 15, 2026 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants